March, 2016

In Graphic Detail

Data Science in Application Performance Management

Bill Kayser

Distinguished Engineer, New Relic

March, 2016

Bill Kayser

  • Founding engineer at New Relic
  • Software architect specializing in web applications
  • Working in APM for 10 years
  • Data Science, Machine Learning, and Visualization

Congratulations!

Now What?

What's the average time to service a request?

All good?

Maybe this will help.

Lets look at recent history.

Median

Quartiles, 25% and 75%

…95% and 99%

Throughput

Histograms

Histograms

Histogram with Summary Values

Combining Histogram with Timeseries

Scatterplot of Each Request

Using Everything at Once

Estimating the Value of Data

What sort of insight does it provide?

  • Does it quantify something we care about?
  • Does it give us a qualitative assessment of the current status?
  • Does it help us measure differences across similar things, like applications or servers?
  • Does it help us identify trends and anomalies?
  • Does it reveal underlying patterns or relationships in the data?

Estimating the Cost

How cheaply can we collect the data?

  • Is space or bandwidth a premium?
  • Can it be reduced by combining values into a single value, or do we have to store every measurement?
  • How much screen real estate does it take up?

The smaller the data the more things you can measure.

The less space it takes the more options you have for screen layout.

Reducing Data

Can you reduce your measurement in X, \(x_1 .. x_n\), by combining values using a reduce function, \(f(a,b)\)?

\[ X_{ab} = f(a, b)\] \[ X_{cd} = f(c, d)\] \[ X_{abcd} = f(X_{ab}, X_{cd})\]

Space required grows linearly with length of history; a simple time series

  • Min and Max: \(min(X) = min\left(min(x_1, x_2), min(x_3, x_4), ... min(x_{n-1}, x_n)\right)\)
  • Sum: \(\sum_{i=1}^n x_i = (x_1 + x_2) + (x_3 + x_4) +...(x_{n-1} + x_n)\)
  • Mean: \(\bar{x} = \frac{1}{n}\sum_{i=0}^n x_i\)
  • Standard Deviation: \(\sigma = \sqrt{\frac{\sum_{i=0}^n x_i^2 - n\bar x }{n - 1}}\)

Hard To Reduce Data

Examples:

  • Histograms (keep N buckets for each time period)
  • Median
  • Percentiles

The amount of storage required depends on the length of history but also the granularity of the data.

You need N buckets in each time period.

Irreducible Data

Examples:

  • Scatterplots
  • Replays

Requires keeping a complete history of every measurement.

Space is proportional to the number of measures.

Cost/Benefit Grid

Cost/Benefit - Mean Statistic

Cost/Benefit - Scatterplot of Every Request

Cost/Benefit - Histograms

Cost/Benefit - Histograms

Can we make use of the Standard Deviation?

Estimating Quartiles Using Standard Deviation

Estimating Quartiles Using Standard Deviation

If response times had a normal distribution we could show everything using simple statistics and estimated percentiles:

Estimating Quartiles

Summary Statistics Don't Work on Non-Gassian Data

Cost/Benefit - Standard Deviation

Mean vs Median

Non-gaussian distributions

Mean vs Median

Non-gaussian distributions

Mean vs Median

Non-gaussian distributions

Mean vs Median

Non-gaussian distributions

Cost/Benefit - Median

Histograms

Multi-modal Histograms

Multi-modal Histograms

Multi-modal Histograms

Histogram Example

Gamma Distribution Approximation

Log Normal Approximation

Response Times in Log Space

Response Times in Log Space

Response Times in Log Space

Calculating GM and GSD

Geometric Mean

Instead of collecting the sum of the response times, you collect the sum of the log of the response times:

\(GM = exp\left(\frac1n\sum_{i=1}^n ln(t_i)\right)\)

Geometric Standard Deviation

Instead of collecting the sum of squares to calculate the standard deviations, you collect the sum of the squared log of response times:

\(GSD = exp\left(\sqrt{\frac1n\sum_{i=1}^nln(t_i)^2-\left(\frac1n\sum_{i=1}^nln(t_i)\right)^2}\right)\)

Inner Quartile Interval Estimate

\(\left[\frac{GM}{GSD^Z}\dotsb GM\times GSD^Z\right]\)

…where \(z = 0.674\) from the Standard Normal Distribution.

Cost/Benefit - Geometric Mean and Median

Measuring User Experience with Apdex

What is Apdex?

  • An Apdex score is a standardized measure indicating whether your site is meeting it's obligations
  • Values from 0 (total failure) to 100 (every request succeeds)
  • Scored based on responses bucketed according to response time thresholds:
    • Satisfied
    • Tolerating
    • Failed
  • Timeouts and errors automatically placed in Failed bin.

What is Apdex?

The Apdex Score is determined by the formula:

\[\frac{N_s + (1/2 * N_t)}{N}\]

Where:

  • \(N_s\) = The number of requests completed in no more than \(T\) seconds
  • \(N_t\) = The number of requests completed between \(T\) and \(4T\)
  • \(N_f\) = The number of requests taking longer than \(4T\), including requests that had errors or timed out (ignored by the formula).
  • \(N\) = \(N_s + N_t + N_f\) = the total number of requests processed.

\(T\) is referred to as the Apdex T parameter and is set individually for each application based on what is considered a satisfactory response time from a business perspective.

Configuring Apdex T

Configuring Apdex T

Apdex Buttons

Apdex Scores for a group of apps can be enumerated with guages, but a richer visual is the "Apdex Button".

  • Each application's status is represented with its own "Traffic Light"
  • Green → Good
  • Red → Bad

Apdex Buttons

Apdex vs Mean

Cost/Benefit - Apdex

Additional Visualizations

Faceted Views

Looking at collection of related things

  • Cluster of servers
  • Application server nodes
  • Pages in an application
  • Response times broken down by browser

Sparklines - Mean

Sparklines - Inner Quartiles (est.)

Sparklines - Inner Quartiles, Free Y Axis

Sparklines - Horizon Plots

Sparklines - Frequency Plots

Sparklines - Frequency + Rug

Sparklines - Density Filaments

Box Plots

Throughput vs Response Time

Plots of throughput on the X axis and latency on the Y axis. Color is time of day.

Animation

Summary - Monitoring Latency

When you can collect a lot of data

  • Use Scatterplots for deep visibility
  • Use Median, Inner Quartiles for plots and sparkcharts
  • Use histograms to identify multi-modal behavior
  • Collect multiple dimensions for faceted views when possible

Summary - Monitoring Latency

When space is a premium

  • Be careful using the mean
  • Use Geometric Mean to approximate the median
  • Use Geometric Standard Deviation to approximate inner quartile regions
  • Use Apdex for monitoring the quality of the customer experience

Consider alternate visualizations

  • Throughput vs Response Time Scatterplots
  • Density Filaments
  • Horizon Plots
  • Animation

Thanks!